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Abstract — Inspired by biological vision systems, the over-complete local features with huge cardinality are increasingly 
used for face recognition during the last decades. Accordingly, feature selection has become more and more important 
and plays a critical role for face data description and recognition. In this paper, we propose a trainable feature selection 
algorithm based on the regularized frame for face recognition. By enforcing a sparsity penalty term on the minimum 
squared error (IVISE) criterion, we cast the feature selection problem into a combinatorial sparse approximation problem, 
which can be solved by greedy methods or convex relaxation methods. IVIoreover, based on the same frame, we propose 
a sparse Ho-Kashyap (HK) procedure to obtain simultaneously the optimal sparse solution and the corresponding margin 
vector of the MSE criterion. The proposed methods are used for selecting the most informative Gabor features of face 
images for recognition and the experimental results on benchmark face databases demonstrate the effectiveness of the 
proposed methods. 

Index Terms — Face recognition, feature selection, sparse approximation, minimum squared error criterion, Ho-Kashyap 
procedure. 



1 Introduction 

WITHIN the last several decades, face 
recognition has received extensive at- 
tention due to its wide range of application 
from identity authentication, access control 
and surveillance to human-computer interac- 
tion and numerous novel face recognition al- 
gorithms have been proposed [43 J, [28 J. One 
of the key issue to successful face recognition 
systems is the development of effective face 
representation, namely how to extract and se- 
lect the discriminative features to represent face 
image. According to the region from which fea- 
tures are derived, face representation methods 
can be generally divided into two categories: 
holistic representation and local representation. 
The holistic representation extract features from 
the whole face image while the local represen- 
tation calculating the features from the local 
faical regions. 

After the introduction of the well-known 
Eigenfaces 1331, the holistic representation meth- 
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ods were extensively studied H, HH, jH, ||T3ll, 
l'37'l, 1^34]. However, local areas are often more 
descriptive and more appropriate for dealing 
with those facial variations due to expression, 
partial occlusion and illumination, since most 
variations in appearance only affect a small 
part of the face region. Local feature analysis 
(LFA) ^] pioneers the study of local repre- 
sentation for face recognition. Recently, local 
representation approaches have received more 
attention and have shown more promising re- 

suits 1361, mi, mi, 13, mi, in, m, nn, ma, 

H38l , HH. A lots of local feature descriptors, 
such as Haar-like features |T6], SIFT features 
[5J, histograms of oriented gradient (HOG) [2J, 
edge orientation histograms (EOH) [39 J , Gabor 
features 13611 , HTSll , local binary patterns (LBP) 
mi, mSl, Bio-inspired features [23], learned 
descriptor f6] etc., have been successfully ap- 
plied in face recognition. These local features 
are often over-completed, whereas only a rel- 
atively small fraction of them is relevant to 
the recognition task. Thus feature selection is 
a crucial and necessary step to select the most 
discriminant local features to obtain a sparse 
face representation. While the prior knowledge 
to the choice of this local feature dictionary of 
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large cardinality is often limited and a consis- tor, the MSE procedure cannot guarantee to 

tent theory is still missing, numerous learned obtain the optimal separating vector even in 

methods are emerging in the empirical practice the separable case [9J. We impose the sparse 

due to their effectiveness (refer to [12J for an constrains on the Ho-Kashyap (HK) procedure 

excellent review of feature selection approaches [9J and propose a named sparse HK (SHK) 

in machine learning). Adaboost-based methods procedure to obtain simultaneously the optimal 

are the most popular and impressive feature sparse solution and the corresponding margin 

selection methods in face recognition Scenario vector. Similar to the original HK procedure, 

IIT61, ||42l, HiOl, 1271, 1261 , El, 1^. One possible the proposed SHK procedure is an iterative 

problem of these methods is very time consum- scheme that alternates between solution of the 

ing in the training stage for the need of train- sparse vector based on the current margin vector 

ing and evaluating a classifier for each feature and a process of updating the margin vector, 

component. An alternative is the regularized- It is flexible and can work with any greedy 

based method which sparsify with respect to a methods or convex relaxation methods, 

dictionary of features by the sparsity-enforcing Gabor and LBP are two most representative 

regularization techniques [|29il , [18 J. The main local features in face recognition. We select 

merits of such a regularized approach are its Gabor feature as the start representation due 

effectiveness even in the presence of a very to its peculiar ability to model the spatial 

small number of data coupled with the fact summation properties of the receptive fields of 

that it is supported by well-grounded theory the so called "bar cells" in the primary visual 

|[8|. Another potential merit is that the reg- cortex. Then we apply the proposed feature 

ularized methods analyze all feature compo- selection method to select the most informative 

nents together and may be more appropriate to Gabor features for face recognition. Experimen- 

capture groups of correlated features, whereas tal results on the benchmark face databases 

the Adaboost-based method only consider the demonstrate the effectiveness of the proposed 

relevance of each feature separately, thus may feature selection methods, 

ignore the possible dependencies between fea- Our method may be mainly inspired by the 

tures. work ||8l which is also applying the sparse 

Based on the regularized frame, in this paper, regularized term to the linear model to perform 
we propose a novel feature selection method the feature selection. Nevertheless, the linear 
based on classical minimum squared error model in [8J neglects the bias on the one hand 
(MSE) criterion [9J which assumes a linear de- and only enforces the linear dependence be- 
pendence between the feature components and tween the feature components and the class 
the discriminant functions. We cast the feature labels on the other hand. In fact, this simple lin- 
selection problem into a combinatorial sparse ear dependence is equivalent to set all entries 
approximation problem by enforcing a sparsity of the margin vector equal to 1 in MSE criterion 
penalty term on the MSE criterion and the model. Our method starts off with the MSE 
solution can be obtained by greedy methods criterion and considers simultaneously the bias 
such as matching pursuit (MP) |l2T1] or the and the adaptive margin vector and hence can 
orthogonal matching pursuit (OMP) [30J, 1311 , be seemed as a generalization of the method 
Il32l and convex relaxation methods I3TI . We in [8]. Moreover, in Q the sparse solution 
restrict ourselves to the linear models because is obtained through iterative soft-thresholding 
they are relatively easy to compute and in the method and the convergence relies on the care- 
absence of information suggesting otherwise, ful normalization of each features component 
linear models are an attractive candidates. Fur- of all training samples at a time, which may de- 
ther, the linear model can be extended to the stroy the structure of the features. Our method 
nonlinear cases by explicitly or implicitly giv- adheres to the original features without any 
ing some function of the local feature compo- additional normalization and also obtain the 
nents. The latter is the well-known kernel trick, convergence. 

Due to the arbitrary selection of margin vec- The rest of this paper is organized as follows. 
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In Section 121 we start off with the MSE criterion 
and propose a novel feature selection method 
based on sparsity-enforcing regularized tech- 
niques. Based on the same frame, in Section 
121, we present a sparse extension of the classical 
HK procedure for feature selection. In Section 
m we first briefly review the Gabor face repre- 
sentation and then illustrate how to apply the 
proposed feature selection frame to select the 
most informative Gabor features for face recog- 
nition. Experiments and analysis are described 
in Section |5l whereas Section |6] concludes the 
paper. 

2 Feature selection based on 
sparse MSE Criterion 

In this section, we present a new feature se- 
lection algorithm based on the MSE criterion. 
As mentioned before, we restrict ourselves to 
the case of a linear discriminant functions that 
are linear in the components of feature x = 

[■^1 ) • • • ) -^dl • 

d 

g{x) = ^ UiXi + ujo = y'^a, (1) 

1=1 

where g{x.) denotes the discriminant function; 
loq is the bias or threshold; uji{i = l,...,d) 
is the weights; y = [1, xi, . . . , x^]^ and a = 
[ojo, . . . , cjd]^ are the augmented feature vec- 
tor and augmented weight vector, respectively. 
Since the face recognition can be cast into a 
classification of the intra-personal and extra- 
personal variation [24], we focus on a binary 
classification problem. As suggested in |19]], 
we substitute all negative samples (i.e. extra- 
personal variations) with their negatives to for- 
get the labels and look for a weight vector such 
that yf a > for all of the samples. Indeed, this 
relation is invariant under a positive scaling of 
a. Thus, we can define a canonical hyperplane 
such that yfa = bi where bi is a positive 
constant called the margin. Now the problem 
can be reformulated as the following linear 
system of equations: 

Ya = b, (2) 

where Y = [yi,...,yn]^ G 7^"^('^+l) is the 
augmented feature matrix and b = . . . , 6„]^ 



is the margin vector. Due to the size of Y, it 
is infeasible to obtain the exact solution of ©. 
One classical relaxation is to solve the minimize 
squared error criterion function 

n 

min II Ya — b||2 = min ^^(yf a — (3) 

It can be solved by a gradient search procedure. 
However, the MSE solution do not provide 
feature selection in the sense because it's typi- 
cally non-sparse. By enforcing sparse regulariza- 
tion term on the MSE criterion, we can turn 
the feature selection into solving the following 
sparsity-enforcing MSE (SMSE) criterion: 

min II Ya — b||2 + T^||a||o, (4) 

where || ■ ||o is the Iq quasi-norm counting the 
nonzero entries of a vector and r is a threshold 
that quantifies how much improvement in the 
approximation error is necessary before we ad- 
mit an additional term into the approximation. 
It is a classic combinatorial sparse approximation 
problem and can be solved by greedy tech- 
niques such as MP and OMP which construct a 
sparse approximant one step at a time by select- 
ing the atom most strongly correlated with the 
residual part of the signal and use it to update 
the current approximation. An alternative to 
solving the SMSE criterion (HJ) is the convex 
relaxation methods which replace the prob- 
lem with a relaxed version that can be solved 
more efficiently. The li norm provides a natural 
convex relaxation of the Iq quasi-norm, and it 
suggests that we may be able to solve sparse 
approximation problems by introducing an li 
norm in place of the Iq quasi-norm. From this 
heuristic, a relaxed version of SMSE (RSMSE) 
criterion can be derived as follows: 

min i||Ya - b||2 + 7||a||i, (5) 

which is an unconstrained convex function and 
thus standard mathematical programming soft- 
wares can be used to find a minimizer. The 
parameter 7 negotiates a compromise between 
approximation error and sparsity. It has been 
proved that if the feature matrix Y is incoher- 
ent and the threshold parameters are correctly 
chosen, then the solution to RSMSE criterion 
dS]) identifies every significant atom from the 
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solution to SMSE criterion dU) and no others 

From a run-time point of view, we adopt 
OMP to solve the SMSE criterion ©. Since 
OMP is iterative, we must supply a criterion 
for stopping the iteration. Here are two possi- 
bilities: 

• One may halt the procedure when the 
norm of the residual declines below a spec- 
ified threshold. 

• One may halt the procedure after pre- 
defined number of distinct feature compo- 
nents have been selected. 

In our implementation, the iteration is stopped 
whenever one of the above conditions is satis- 
fied. 

Notice that in the criterion © © and ©, 
the entries of the margin vector b are arbitrary 
positive constants. Obviously, different choice 
of b would typically lead to different solutions. 
As the MSB solution is directly related to the 
Fisher discriminant vector with a proper choice 
of the margin vector (i.e. the entries 6j corre- 
sponding to the same class are equal to the 
ratio of the sample size of this class to the total 
sample size) [9], the SMSE solution or RSMSE 
solution gives a natural sparse generalization of 
Fisher linear discriminant. Hereafter we refer 
to the resulting feature selection algorithm as 
sparse Fisher (SFisher) procedure. Moreover, if 
we set b = 1„ and wq = (we refer to the 
resulting algorithm as simplified SMES proce- 
dure, or SSMES), in this special case the RSMSE 
criterion ^ degenerates into the linear model 
described in [8J, and thus our method can also 
be seemed as a generalization of the method in 

m. 

3 Feature selection based on 
sparse HK procedure 

Because the objective is minimizing ||Ya — bjlg, 
as discussed in [9J, the MSE procedures yield 
a solution whether the samples are linearly 
separable or not, but there is no guarantee that 
this vector is a separating vector even in the 
separable case. However, in the separable case, 
there do exist a margin vector b with all positive 
entries such that the corresponding MSE solu- 
tion is the separating vector. The HK procedure 



extends the MSE procedure to deal with this 
problem by determining a and b alternately 
where the components of b cannot decrease. 
Borrowing from the same ideas, in this section 
we propose a sparse version the HK procedure 
to extend our method described in the former 
section. Specifically speaking, in the proposed 
SHK procedure there are two stages for each 
iteration: one for sparse approximating that es- 
sentially evaluates a and one for updating the 
margin vector b. Sparse approximating can be 
conveniently performed by using greedy or 
convex relaxation algorithms to solve the SMES 
criterion (HJ with a given b. Similar original HK 
procedure, the updating rule of b is to start 
with b > cQ and to refuse to reduce any of its 
components, namely 

r b(i) > 

lb(t + l) = b(t) + 2r](t)e+(t), ^ ^ 

where ri(t) is a positive scale factor or learn 
rate; e(t) = Ya(t) — h{t) is the error vector and 
e+(t) = |(e(t) + |e(t)|) is the positive part of the 
error vector, respectively. Given some stopping 
rules, our algorithm is: 
Algorithm SHK. 

Initialization: Set b(0) > 0, < r7(-) < 1. Set the 
iteration index t = 1. 

Repreat until convergence (stopping criterion): 

. Sparse approximation stage: Use any greedy 
algorithms or convex relaxation methods 
to computer a(A;) by approximating the 
solution of SMES criterion (H). 

• Margin vector update stage: h{t + 1) = 
h{t)+2r]{k)e+{t). 

. Set t = t + 1. 

In our implementation, the stopping rule is that 
when ||b(t + 1) — b(t)||2 < e is reached, the loop 
is terminated. 

It is noteworthy that although the conver- 
gence of the original HK procedure can be 
proven theoretically 1^91, owing to the intro- 
duction of the sparse approximation stage, exact 
analysis of the convergence of the proposed 
SHK algorithm in a deterministic manner is 
rather complicated or even impossible. Nev- 
ertheless, we can obtain the convergence by 

1. b > means that every component of b is positive. 
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careful selection of ri{t) which decreases with 
t. Our choice is to set ri{t) = 

4 Gabor feature selection for 
face recognition 

In this section we describe how we specialize 
the proposed feature selection frame to the case 
of face recognition. We first briefly review the 
Gabor representation of face and then describe 
how to apply the proposed feature selection 
methods to select Gabor features for face recog- 
nition. 

4.1 Gabor representation 

We start with the widely used Gabor repre- 
sentation because the kernels of Gabor filters 
are similar to the 2D receptive field profiles of 
the mammalian cortical simple cells and exhibit 
desirable characteristics of spatial locality and 
orientation selectivity |l36|l, USB, UB- The Gabor 
representation of a face image can be obtained 
by convolving the image by a set of Gabor 
filters which are commonly defined as follows: 



where z is the coordinate vector; parameters 
and u define the orientation and the scale of 
the Gabor filter; parameters a is the standard 
deviation of Gaussian window; k^ ,y is the wave 
vector given by li^^u = k^e^'^^^, where ky = -^jf^ 
and = ^ if eight different orientations have 
been chosen; k^ax is the maximum frequency, 
and / is the spatial factor between kernels in 
the frequency domain. In face recognition area, 
researchers commonly use 40 Gabor filters with 
five scales G {0, ■ ■ ■ , 4} and eight orientations 
fi e {0, ■ ■ ■ , 7} and with a = 2n, kmax = \ 
and / = \f2. However, we set the scale ranges 
from -1 to 2 rather than from to 4 due to the 
using of smaller size of face images in our ex- 
periments. Thus only 32 Gabor filters are used. 
Convolving the face image with these 32 Ga- 
bor filters and only extracting the magnitudes 
information can then generate a high dimen- 
sional Gabor representation. For example, for 
an image with 64 x 64 pixels, the total number 
of Gabor features is 4 x 8 x 64 x 64 = 131, 072. 



A noticeable problem in discrete convolution 
is the choice of proper size of the convolution 
mask. It should be large enough to show the 
nature of Gabor kernels and not be too large 
for the computation efficiency. As suggested in 
IITOl , we truncate the Gabor filters to six times 
the span of the Gaussian function. As the span 
of Gaussian function is the Gabor mask 
is then truncated to a width w = |^ + 1 = 
24 X 2^ + 1. Thus in our experiments the size of 
Gabor filters are 19 x 19, 25 x 25, 35 x 35, 49 x 49 
corresponding to the scale ofi/G{— 1,---,2}. 

4.2 Feature selection for face recognition 

Now it time to turn our attention to the feature 
selection of the high dimensional Gabor repre- 
sentation. Similarly to Moghaddam et al. ||24||, 
we temporarily cast the face recognition into a 
classification of the intra-personal (hereafter as 
positive) and extra-personal (hereafter as neg- 
ative) variation. For each pair of face images 
and /•, we compare the corresponding Gabor 
feature components. Specifically, for each pair 
of input images we obtain a feature vector 
Xj whose elements are the absolute difference 
between the corresponding Gabor representa- 
tions. Given a training set, we can then get 
the augmented feature matrix Y following the 
routine described in Section |2l 

A by-no-means negligible problem in practi- 
cal is the overwhelmingly huge size and unbal- 
ance of the training samples [16]. For instance, 
given a training set that includes K images for 
each of the C individuals, the total number 
of image pairs is ('"2^) whereas only a small 
minority, C(^) of these pairs display the intra- 
personal variation. Let C = 300 and K = 4, 
then the size of positive samples and nega- 
tive samples are 1, 800 and 717, 600 respectively 
with their ratio be close to 1 : 400. 

Obviously, such a huge samples size will 
lead to severe memory and computational 
problem. In addition, the unbalance training 
samples may bias the performance of the fea- 
ture selection. In order to obtain balanced sys- 
tems of reasonable size, we randomly sample 
the positive and negative samples with a com- 
parable ratio to build the augmented feature 
matrix Y. In practical, we can sample negative 
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samples while keeping all positive samples 
with their ratio varying from 1 : 1 to 1 : 10. 

Once we build the augmented feature matrix 
Y, we can find the solution of SMSE criterion 
(ID with a given or an adaptive margin vector 
b according to the procedure previously de- 
scribed. Then the Gabor feature components 
corresponding to non-zero entries of the aug- 
mented weight vector a are selected as the 
most informative ones and used for further face 
recognition. 

Recalled that the above feature selection 
frame based on linear discriminant functions 
also establishes a linear classifier with a bias 
Wo discriminating the intra-personal and extra- 
personal difference, so it can be used for face 
recognition directly. However, one can also 
consider its usage as a pure feature selection 
tool to reduce the numbers of Gabor features 
and adopt some other common classifiers such 
as nearest neighbor classifier (NNC), Fisher classi- 
fier (FC) [9J or support vector machines (SVM) LZJ 
for the recognition. 

5 Experiments AND RESULTS 

In order to evaluate the proposed approach, we 
carry out some experiments on two large face 
databases: CAS-PEAL-Rl (H and LFW ESI 



face database. The CAS-PEAL-Rl face database 
contains 30,863 images of 1,040 Chinese sub- 
jects with different variations of pose, expres- 
sion, accessories, age, and lighting. The LFW 
face database contains 13, 233 labeled face im- 
ages collected from news sites in the Internet. 
These images belong to 5, 749 different indi- 
viduals and have high variations in position, 
pose, lighting, background, camera and quality. 
Therefore LFW database is more appropriate to 
evaluating face recognition methods in realistic 
and unconstrained environments. 

In all our experiments, each image is rotated 
and scaled so that the centers of the eyes are 
placed on specific pixels and then was cropped 
to 64 X 64 pixels q As described before, we 
only select 32 Gabor filters to extract the Gabor 
features. 

2. The eyes locations are given in CAS-PEAL-Rl database. 
For LFW database, we adopted standard fiducial point detector 
to extract the eyes locations and annotated them manually 
whenever the automatic eyes locator failed. 



5.1 Results on CAS-PEAL-R1 database 

We restrictively follow the CAS-PEAL-Rl eval- 
uation protocol which specifies one training 
set, one gallery set and six probe sets [)11||. 
Therefore the training sets include 1,200 im- 
ages of 300 subjects and the ratio of intra- 
personal sample size to extra-personal sample 
size is 1,800 : 717,600. We keep all intra- 
personal samples while randomly sampling the 
extra-personal samples with a ratio of 1 : 7. 
If all Gabor features are considered, the linear 
problem we are about to build is rather large. In 
fact, the size of the augmented feature matrix Y 
is come to 131,073 x 14,400. Obviously direct 
multiplication on such a matrix is infeasible. 
One possible choice is to reduce the number 
of Gabor features if possible. With the prior 
knowledge of that the magnitude of the Gabor 
filters is not sensitive to the positions, we can 
reduce the number of positions by a simply 
down sampling scheme with a factor 16. Thus 
the number of positions is roughly one six- 
teenth of the total number of pixels. So after 
the down sampling, the size of the augmented 
feature matrix Y is reduced to 8193 x 14,400. 

5.1.1 Feature selection results 

We conducted experiments on the CAS-PEAL- 
Rl training set using SSMES, SFisher and SHK 
procedure to select 500 most informative Gabor 
features, respectively. Their characteristics can 
be observed by their statistics. The location dis- 
tribution of selected Gabor features are shown 




SSMES 



SFisher 



SHK 



Fig. 1. Location distribution of the first 500 
selected Gabor features on CAS-PEAL-R1 
dataset. 

in Fig. [U It is interesting to see that most 
of selected Gabor features resulting from all 
three methods are located around the prominet 
facial features such as eyebrows, eyes, nose 
and mouth, while seldom being located on the 
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cheek area. This indicates that the prominent 
facial features regions carry the most important 
discriminating information while the cheek re- 
gion conveying less information. Moreover, a 
minority of selected features are located on 
external features such as cheek contour and jaw 
line. In fact, although the external region does 
not cover the face much, the external features 
implicitly uses shape information and thus are 
useful for distinguishing thin faces from round 
faces. This result is agreed with Ref. F4T11 . 

0.1 I 1 1 1 1 1 1 1 

0.09 - 
0.08 ^ 
0.07 H 
^0.06- 
I 0.05 - 




5 10 15 20 25 30 35 



(a) 

0.1 I 1 1 _ 1 

0.09 - 
0.08 - 
C.07 ■ 




5 10 15 20 25 30 35 

Gabor kernels 



(b) 

0.1 I 1 1 1 

0.09 - 
0.08 - 
0.07 - 
^ 0.06 - 




5 10 15 20 25 30 35 



(C) 

Fig. 2. Distribution of 32 Gabor kernels in the 
first 500 leading Gabor features on GAS-PEAL- 
R1 dataset. (a) SSMES. (b) SFisher.(c)SHK. 

We also compared the frequency of Gabor 
kernels in the selected Gabor features. Fig. |2] 
illustrates the frequency of the 32 Gabor ker- 
nels in the leading 500 Gabor features selected 
by SSMES, SFisher and SHK procedure. Ob- 
viously, different scales and orientations con- 



tribute different and the distribution of features 
selected by different methods is somewhat uni- 
form: the 0-scale and 1-scale are more likely 
important than the other two scales and most 
of horizontal and vertical Gabor kernels have 
extracted stronger features than those with 
other orientation. 

5. 1.2 Classification results 

The selected Gabor features are then adopted 
for face recognition. The classical classifiers, 
NNC and FC, are chosen to recognize the faces. 
As mentioned above, the proposed feature se- 
lection frame can perform the intra-personal 
and extra-personal recognition task. Thus we 
also used it for face recognition by treating 
the face recognition as a series of pair matching 
problems. However, in many situations there 
are more than one subject satisfying the sep- 
arating condition. In order to make a final 
decision we simply classify the unknown face 
as the subject whose samples can maximize the 
linear discriminant function i.e. the margin. 
Therefore in some sense it can be seen as a 
maximum margin classifier (MMC). 

We also implemented 3 previous Gabor- 
based approaches for comparison. The first 
is using Gabor feature without feature se- 
lection for face representation and NNC for 
recognition, which is denoted as "G+NNC". 
The second method "G+FC" denotes the GFC 
method in ||20l, i.e. the PCA+LDA on down- 
sampled Gabor features. The third method, "G 
Ada+FC", is the AGFC method in [26] which 
using Adaboost to select Gabor features and FC 
for classification. For clarity, "G SSMES+NNC", 
"G SFisher+NNC" and "G SHK+NNC" re- 
spectively denote the method using SSMES, 
SFisher and SHK procedure to select Gabor 
features and NNC for recognition. Similarly, 
for the other two classifiers, the corresponding 
methods are denoted as "G SSMES+MMC", 
"G SFisher+MMC", "G SHK+MMC" and "G 
SSMES+FC", "G SFisher+FC", "G SHK+FC". 
We investigated 3 kinds of distance measure- 
ments: h distance, I2 distance and cosine dis- 
tance, and found that for NNC, li distance 
achieves the best performance while for FC, 
the cosine distance performing best. Thus we 
selected h distance for NNC and cosine distance 
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for FC. In our implementation, the number 
of Gabor features used in "G+FC" is down- 
sampled to the dimension of 8, 192 and in "G 
Ada+FC", 2, 000 Gabor features are selected 
by Adaboost. The optimal dimension for PGA 
and LDA are determined by testing all possible 
dimensions. The results on 5 different probes 
sets are shown and compared in Table [H 



TABLE 1 

Recognition Performance comparisons on 
different CAS-PEAL-R1 probe sets (%) 



Methods 


Age 


Exp 


Dis 


Bac 


Acc 


G+NNC 


78.79 


81.34 


98.18 


93.49 


63.63 


G+FC 


96.97 


92.68 


98.91 


98.19 


84.55 


G SSMES+NNC 


87.88 


78.79 


95.64 


94.94 


68.32 


G SSMES+MMC 


7.58 


8.47 


17.45 


19.71 


5.38 


G SSMES+FC 


100.00 


89.94 


98.18 


97.47 


79.34 


G SFisher+NNC 


93.94 


77.26 


96.00 


94.94 


68.45 


G SFisher+MMC 


74.24 


74.97 


93.82 


91.68 


49.85 


G SFisher+FC 


100.00 


90.32 


97.82 


97.65 


80.88 


G SHK+NNC 


93.94 


79.87 


96.36 


95.12 


69.98 


G SHK+MMC 


75.76 


77.01 


94.54 


92.22 


52.12 


G SHK+FC 


100.00 


91.34 


98.91 


98.92 


80.96 


G Ada+FC 


96.97 


89.81 


96.73 


97.83 


78.77 



From Table [T], we can obtain several major 
observations. First, although the proposed fea- 
ture selection frame also establishes a classi- 
fier which can be straightforwardly used for 
face recognition, its performances are not as 
satisfactory as expected, especially for the "G 
SSMES + MMC" method. Our explanation is 
that though the feature selection frame can se- 
lect effectively the meaningful features, it may 
overestimate or underestimate the correspond- 
ing weights, leading to the over-fitting prob- 
lems. Comparing to the "G SSMES + MMC", 
the classifiers used in "G SFisher + MMC" 
and "G SHK + MMC" both consider the bias 
and the margin and thus achieve better re- 
sults. The second observation is that the FC 
based methods ("G+FC", "C SMESS+FC", "C 
SFisher+FC", "C SHK+FC" and "G Ada+FC") 
perform much better than the other two clas- 
sifiers based methods. In general, the algo- 
rithms with regularized-based feature selection 
procedure only use 500 Gabor features and 
slightly outperform "G Ada+FC" with 2,000 
Gabor features selected by Adaboost and is 
comparable to "G+FC" using 8, 192 Gabor fea- 
tures, which shows that the proposed feature 



selection frame is effective for face recognition. 
Third, SFisher and SHK perform better than 
SSMES in the sense of both feature selection 
and classification. This results indicate that the 
consideration of bias and the margin will not 
make the learning process overfits the training 
data but increase the generalizability. 

5.2 Results on LFW database 

We also conducted some experiments on the 
LFW database for further investigation. Unlike 
the CAS-PEAL-Rl database, the LFW database 
have larger degree of variability and the recog- 
nition is only to be done by pairs matching, 
instead of searching for the most similar face 
in the database. We still followed their protocol 
which gives two Views: View 1 for model selec- 
tion and algorithm development while View 2 
for performance reporting. View 1 specifies one 
training set containing 2, 200 pairs and one test- 
ing set containing 1, 000 pairs. View 2 consists of 
ten sets with 600 images in each case. They can 
be combined into 10 different training/testing 
set pairs. In our experiments, the training set of 
View 1 are chosen for training the feature selec- 
tion model and the performance are reported 
using 10-fold cross validation on the View 2. 

The proposed feature selection frame is used 
as a feature selector to select 500 most in- 
formative Gabor features from the original 
131,072 original features. We directly adopted 
the proposed frame as a classifier to recognize 
the unknown pairs in company with the SVM 
classifier. The corresponding methods are re- 
ferred as "G SMESS", "G SFisher", "G SHK" 
and "G SMESS+SVM", "G SFisher+SVM", "G 
SHK+SVM" respectively. We also investigated 
the performance of the method "G+FC" which 
uses all 131,072 original features as represen- 
tation and FC as a classifier. The results of 
the experiments are described in Table |2] below 
and the ROC comparison curves of different 
methods are illustrated in Fig. |3l 

As can be seen, a direct application of pro- 
posed feature selection frame as a classifier per- 
form somewhat worse than the performance 
achieved by using a SVM classifier. Recalled 
that the "G SFisher" algorithm actually per- 
forms a sparse Fisher classification which only 
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TABLE 2 

Mean (± standard error) recognition accuracy 
on View 2 of LFW data set (%) 



Methods 


Recognition accuracy 


G+FC 


67.10 ± 0.53 


G SSMES 


60.60 ± 0.64 


G SFisher 


66.70 ± 0.49 


G SHK 


68.27 ± 0.58 


G SSMES+SVM 


68.30 ± 0.59 


G SFisher+SVM 


69.18 ± 0.52 


G SHK+SVM 


70.32 ± 0.44 
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Fig. 3. ROC curve comparison on View 2 of 
the LFW data set. Each point on the curve 
represents the average over the 10 folds of 
(false positive rate, true positive rate) for a fixed 
threshold. 

uses 500 features and achieves a compara- 
ble performance of the "G+FC" method using 
131,072 original features both in terms of ac- 
curacy and ROC curve (the accuracy is slightly 
lower, but the ROC performance is better). This 
phenomena further demonstrates the effective- 
ness of the proposed feature selection frame. 
Again, SFisher and especially SHK perform 
better than SSMES in the sense of both fea- 
ture selection and classification in this dataset, 
which can be attributed to the consideration of 
the bias and adaptive margin in the linear model 
©■ 

6 Conclusion 

We have presented a novel feature selection 
algorithm based on well-grounded sparsity- 



enforcing regularization techniques for face 
recognition. We cast the feature selection prob- 
lem into a combinatorial sparse approximation 
problem by enforcing a sparsity penalty term 
on the MSE criterion, which can be solved 
by greedy methods or convex relaxation meth- 
ods. Moreover, we introduced the sparsity con- 
strain into the traditional HK procedure and 
proposed a sparse HK procedure to obtain si- 
multaneously the optimal sparse solution and 
the corresponding margin vector of the MSE 
criterion. The proposed frame was applied to 
select most informative Gabor features for face 
recognition and the experimental results on 
CAS-PEAL-Rl face database and LFW face 
database are favorable to the previous state-of- 
the-art Gabor-based methods. Our future work 
includes exploring other more effective low- 
level face representation and other sophisti- 
cated classification strategy to produce better 
performance. 
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